Designing Special Post-Processing Rules for SVM-Based Chinese Word Segmentation

نویسندگان

  • Muhua Zhu
  • Yilin Wang
  • Zhenxing Wang
  • Huizhen Wang
  • Jingbo Zhu
چکیده

We participated in the Third International Chinese Word Segmentation Bakeoff. Specifically, we evaluated our Chinese word segmenter NEUCipSeg in the close track, on all four corpora, namely Academis Sinica (AS), City University of Hong Kong (CITYU), Microsoft Research (MSRA), and University of Pennsylvania/University of Colorado (UPENN). Based on Support Vector Machines (SVMs), a basic segmenter is designed regarding Chinese word segmentation as a problem of character-based tagging. Moreover, we proposed postprocessing rules specially taking into account the properties of results brought out by the basic segmenter. Our system achieved good ranks in all four corpora. 1 SVM-based Chinese Word Segmenter We built out segmentation system following (Xue and Shen, 2003), regarding Chinese word segmentation as a problem of character-based tagging. Instead of Maximum Entropy, we utilized Support Vector Machines as an alternate. SVMs are a state-of-the-art learning algorithm, owing their success mainly to the ability in control of generalization error upper-bound, and the smooth integration with kernel methods. See details in (Vapnik, 1995). We adopted svm-light1 as the specific implementation of the model. 1.1 Problem Formalization By formalizing Chinese word segmentation into the problem of character-based tagging, we ashttp://svmlight.joachims.org/ signed each character to one and only one of the four classes: word-prefix, word-suffix, word-stem and single-character. For example, given a two-word sequence“东南亚 人”, the Chinese words for ”Southeast Asia(东 南亚) people(人) ”, the character “东”is assigned to the category word-prefix, indicating the beginning of a word;“南”is assigned to the category word-stem, indicating the middle position of a word; “亚”belongs to the category word-suffix, meaning the ending of a Chinese word; and last,“人”is assigned to the category single-character, indicating that the single character itself is a word. 1.2 Feature Templates We utilized four of the five basic feature templates suggested in (Low et al. , 2005), described as

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Context-Based Chinese Word Segmentation using SVM Machine-Learning Algorithm without Dictionary Support

This paper presents a new machine-learning Chinese word segmentation (CWS) approach, which defines CWS as a break-point classification problem; the break point is the boundary of two subsequent words. Further, this paper exploits a support vector machine (SVM) classifier, which learns the segmentation rules of the Chinese language from a context model of break points in a corpus. Additionally, ...

متن کامل

Do Chinese Readers Follow the National Standard Rules for Word Segmentation during Reading?

We conducted a preliminary study to examine whether Chinese readers' spontaneous word segmentation processing is consistent with the national standard rules of word segmentation based on the Contemporary Chinese language word segmentation specification for information processing (CCLWSSIP). Participants were asked to segment Chinese sentences into individual words according to their prior knowl...

متن کامل

Word Segmenter for Chinese Micro-blogging Text Segmentation - Report for CIPS-SIGHAN'2014 Bakeoff

This paper presents our system for the CIPSSIGHAN-2014 bakeoff task of Chinese word segmentation. This system adopts a characterbased joint approach, which combines a character-based generative model and a character-based discriminative model. To further improve the performance in cross-domain, an external dictionary is employed. In addition, pre-processing and post-processing rules are utilize...

متن کامل

High OOV-Recall Chinese Word Segmenter

For the competition of Chinese word segmentation held in the first CIPS-SIGHNA joint conference. We applied a subwordbased word segmenter using CRFs and extended the segmenter with OOV words recognized by Accessor Variety. Moreover, we proposed several post-processing rules to improve the performance. Our system achieved promising OOV recall among all the participants.

متن کامل

ISCAS: A Cascaded Approach for CIPS-SIGHAN Micro-Blog Word Segmentation Bakeoff 2012 Track

The state-of-the-art Chinese word segmentation systems have achieved high performance on well-formed long document. However, the segmentation for microblog is difficult due to the noise problem and the OOV problem. In this paper, we present a Chinese Micro-Blog Segmentation system for the CIP-SIGHAN Word Segmentation Bakeoff 2012 track. The proposed system adopts a cascaded approach which conta...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006